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In a pre-test-post-test cluster randomized trial, one of the methods commonly used to 
detect an intervention effect involves controlling pre-test scores and other related 
covariates while estimating an intervention effect at post-test. In many applications in 
education, the total post-test and pre-test scores, ignoring measurement error, are used 
as response variable and covariate, respectively, to estimate the intervention effect. 
However, these test scores are frequently subject to measurement error, and statistical 
inferences based on the model ignoring measurement error can yield a biased estimate of 
the intervention effect. When multiple domains exist in test data, it is sometimes more 
informative to detect the intervention effect for each domain than for the entire test. This 
paper presents applications of the multilevel multidimensional item response model with 
measurement error adjustments in a response variable and a covariate to estimate the 
intervention effect for each domain. 


I. Introduction 

Pre-test-post-test cluster randomized trials are common in educational intervention 
studies because researchers cannot control students’ class assignment, although random 
assignment sometimes occurs at the student level as well (Raudenbush, 1997). Thus, 
study designs have multilevel data in which teachers, classes or schools are randomly 
assigned to intervention. One of the commonly used methods for detecting an 
intervention effect involves controlling pre-test scores and other related covariates when 
estimating the intervention effect at post-test (e.g., Aitkin & Longford, 1986; Goldstein, 
2003, ch. 2). 

Students’ ability scores at pre-test and post-test are vulnerable to measurement error, * 1 
and ability is often measured with a set of items. It has been shown that ignoring 
measurement error in a response variable (i.e., post-test scores) and a covariate (i.e., pre¬ 
test scores) leads to biased parameter estimates. The bias is due to attenuation from 
measurement error in the response variable (e.g., Carroll, Ruppert, Stefanski, & 
Crainiceanu, 2006, ch. 15; Fox, 2004). Measurement error in the covariate is also 
responsible for biased parameter estimates and loss of power to detect relationships 
among variables (Bryk & Raudenbush, 1992; Carroll et al., 2006; Fox & Glas, 2003; 
Goldstein, Kounali, & Robinson, 2008; Rabe-Hesketh, Skrondal, & Pickles, 2004). In 
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1 In this study, we use the term ‘measurement error’ to refer to random measurement error, not systematic 
measurement error. 
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detecting an intervention effect controlling pre-test scores, the effects of pre-test scores 
can be biased in the presence of measurement error in pre-test scores (e.g., Liidtke, Marsh, 
Robitzsch, & Trautwein, 2011). However, for the intervention effect, previous research 
has shown that covariate measurement error is a problem only for non-experimental 
designs with groups that differ in average covariate value in analysis of covariance (e.g., 
Culpepper & Aguinis, 2011; Porter & Raudenbush, 1987). When there is no group 
difference in pre-test scores, bias in the intervention effect estimate may not be of concern 
in the presence of measurement error in pre-test scores (Cho & Preacher, 2015). The 
assumption that there is no group difference in pre-test scores must be tested. Item 
response models can be used to model the relationship between ability and the set of 
individual items when ability cannot be measured perfectly. 

In addition, students’ outcomes in the evaluation of intervention studies often involve 
multiple domains even though the test is supposedly unidimensional. The multiple- 
domain design provides the possibility of detecting intervention effects for each domain 
and thus facilitates diagnostic interpretations of the results. To do so, separate 
unidimensional item response models can be fitted to obtain item response theory 
(IRT) scale scores and an intervention effect on the scale of each domain. However, this 
approach can lead to inaccurate results when the number of test items related to each 
domain is small (e.g., de la Torre, Song, & Hong, 2011). 

Multilevel multidimensional item response models (MMIRMs; Muthen & Asparouhov, 
2013; Rabe-Hesketh et al ., 2004) allow for explicitly modelling measurement error and 
IRT subscoring for multilevel data. The MMIRM provides the opportunity to model latent 
variables with multiple observed items to reduce the effects of measurement error. In 
addition, multiple latent variables for multiple domains are modelled, and the linear 
relationship between the domain-specific latent variables can be obtained at each level of 
multilevel data in the MMIRM. 

Measurement error adjustment is achieved by applying the MMIRM to response 
variables and covariates. Up to this point, MMIRMs have been mainly applied to response 
variables (see Muthen & Asparouhov, 2013, sections 7 and 8). There are examples of 
researchers correctly accounting for measurement error in covariate(s) using unidimen¬ 
sional item response models (Battauz & Bellio, 2011; Fox & Glas, 2003). There are also a 
few examples of measurement error adjustment in response variables and covariates. 
Raudenbush and Sampson (1999) used a multilevel Rasch model to control for 
measurement error in both response variables and covariates. Rabe-Hesketh et al. 
(2004, equation 18, p. 180) specified the linear predictor in a generalized linear model for 
measurement error adjustment in response variables and covariates in multilevel data. 
When a measurement model is specified for both response variables (i.e., post-test scores) 
and covariates (i.e., pre-test scores), latent variables for the covariates are used to explain 
latent variables for response variables at each level of the multilevel data. This makes 
symmetric score mapping possible between post-test scores and pre-test scores. 2 
However, to our knowledge, the multidimensional specification of the linear predictor 
with a logit link or probit link (two-parameter MMIRM) has not been applied to adjust 
measurement error in a response variable and a covariate. 

When an MMIRM as a latent covariate 3 (i.e., a pre-test model) is added to an MMIRM as 
a response variable (i.e., a post-test model), other manifest covariates including a grouping 


2 The authors thank the reviewer of a previous version of this paper for clearly pointing out this modelling feature. 

3 We define the term latent covariate as a covariate measured with measurement error, in contrast to a manifest 
covariate measured without measurement error. 
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variable for the intervention (i.e., a control group vs. a treatment group) and demographic 
variables can be added to account for the ability parameters of the post-test model in a 
structural model. All parameters in the measurement models and the structural model can 
be estimated simultaneously in explanatory item response modelling (De Boeck & Wilson, 
2004) or generalized multilevel structural equation modelling (McDonald, 1993; Muthen, 
1994; Rabe-Hesketh et al ., 2004). 

The purpose of this paper is to present a model specification that includes a two- 
parameter MMIRM in which measurement error is corrected for a response variable and a 
covariate at each level of the multilevel data structure. The MMIRM in this study is an 
MMIRM with a multilevel latent covariate (MMIRM-MLC). The rest of this paper is 
organized as follows. First, we specify an MMIRM-MLC and describe parameter estimation. 
Then we present an empirical study for applications of the MMIRM-MLC, followed by a 
simulation study to evaluate an MMIRM-MLC and to compare its performance with other 
approaches using the total scores. We conclude with a summary and discussion. 


2. MMIRM with a multilevel latent covariate 

In this section an MMIRM-MLC is described, with a measurement model and a structural 
model, for binary responses. Crossed and nested data structures are possible in multilevel 
item response data at pre-test and post-test. If every item is offered to all individuals and 
every individual responds to all items, the item and individual classifications are found at 
the same level, and they are crossed. In addition to the crossed design, there is a multilevel 
design in which individuals (e.g., students) are nested with clusters (e.g., teachers). To 
frame this data structure within the multilevel literature (e.g., Bryk & Raudenbush, 1992), 
item responses at level 1 are cross-classified with individuals and items at level 2. 
Individuals are nested within clusters at level 3- The model description is limited to a 
between-item design in which an item is loaded on one dimension or latent variable for 
subscoring. 

A measurement model, an MMIRM, for correct item responses at post-test (denoted by 
a subscript 2) is as follows, assuming that there is no evidence of measurement bias 
regarding clusters and groups (e.g., control and treatment groups): 

P{y2jki = 1|02 ijk:^2k) = ( t[at2/’ {Oyk + 02*) ~ @2 ./] 5 (1) 

where <I> denotes the standard normal cumulative distribution function, /' is an index for an 
individual (j = 1,...,/), k is an index for a cluster (k = 1,..., K), i is an index for an 
item (i = 1,...,/), d is an index for a dimension (i.e., domain) (d = 1 ,...,72), 
yijki = [j 2 /«],- • -lyijkidT ■ ■ . yijkirA' are item responses across domains at post-test, 
0 2 \jk(x> x i) = [ 02 /vfeiv • ■, 02 \jkth- ■ ■ ’QI ' are multidimensional latent variables at level 2, 
0 2 k(D x i) = [02/feiv • 02 kcv- ■ •>02&z>] / are multidimensional latent variables at level 3, a 2 / 

a x d > are item slopes or item discrimination parameters at post-test, and /i 2 *cz x o) 
are item intercepts or item difficulty parameters at post-test. 0 2Jk and Q>k are assumed 
to follow a multivariate normal distribution, Q 2 jk ~ MN( 0 (£ > x i),2kx> x /») and 
e 2k ~ MN(0 CD x V) , 1-2(0 x /»), respectively. 

A measurement model, an MMIRM, for correct item responses at pre-test (denoted by a 
subscript 1) is as follows, assuming that there is no evidence of measurement bias 
regarding clusters and groups: 
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p(yijki\Oijki &ik) — ®[*u ■ (0\jk + 0i*) — Pu]> ( 2 ) 

wher eyijki = \yijkn>- ■ ->y\jktdi- ■ -XymnY are item responses across domains at pre-test, 
0\jkxn x i) = [Oi«.i,...,0i jka ,.. .,9i j kD \ are multidimensional latent variables at level 2, 8 lk 
(D x i) = [ 0 1 a- i - - ■ -,0i kd,- ■ -fiikD]' are multidimensional latent variables at level 3, i -1 m x ny 
are item slopes or item discrimination parameters at pre-test, and /i, ia x n) are item 
intercepts or item difficulty parameters at pre-test. 0 ljk and 6 lk are assumed to follow a 
multivariate normal distribution, 0 1Jk ~ MN(0 (D x 1) ,X^ (/ ; x n) ) and 0 lk ~ MN 
(P ( d x ,),S 4 (» x £>)), respectively, where p (l) x 0 = [p h .. .,p rf> .. .,Pd]' are intercepts of 
latent variables (i.e., grand mean). 

A structural model for person parameters at level 2 (e.g., the student level) is as follows: 

@2jk = 7o T 7l jk T 'y 7(n+ 1) ■ Zjkn "T s 2jki (3) 

n = 1 


where Z Jk n(X) x 1;) is the nth covariate for an individual/ nested with a cluster k at level 2, 
y mn x i) = f '/oi,- • xYorf,- ■ are intercepts at level 2 (fixed to Os to identify the 

model), y\ ( r> x ny — diag[y 11; .. ,,y lrf ,.. ,,y 1£) ]' are the effects ofthe pre-test score atlevel 2, 
yen + ixo x o) = diag[y ( „ + i )1>f . .,y ( „ + I W ,.. .,y ( „ + 1)£) ]' are the effects of covariates 
Z 7X , „, and Sykxn x o = • P/Mr • xfy,*/;]' are residuals of post-test latent scores at 

level 2, assumed to follow MN(0 (D x , x ny)- 

A structural model for person parameters at level 3 (e.g., the teacher level) is as follows: 

02fe = <5 0 + <5i ■ 01* + S 2 ■ TRTk + ^(m+2) • Zx. m + £ 2 *j (4) 

m = 1 


where TRT hiD x 15 is a covariate of an intervention condition with a value of 0 for 
members of the control group and a value of 1 for members of the treatment group, Z k m 
<x> x i) is the mth covariate for a cluster k at level 3, 8 0W x i> — [5 01 ,. ■ ■ ,8< w ,.. .,§ 0 D] , are 
intercepts at level 3 (i.e., grand mean), <5 1CD x r)) = diag[5 n ,.. .,8 ld ,.. .,5 lD \' are the effects 
of the pre-test score at level 3, (d x d> = diag[S 21) .. .,8 2rf ,.. .,5 2£) ]' are the intervention 
effects at level 3, <5(,» + 2 xo x i> = l ( \,« + 2 )i>- • + 2 Wi■ ■ + 2 yn\' are the effects of 

covariates Z km , and c 2fe(£) x i> = Ifyfetv ■ ■xikch- ■ -,^ 2 ko\' are residuals of post-test latent 
scores at level 3, assumed to follow c 2 ka> x i> ~ MN(0 (D x V) X ( , (n x r>) ). 

Adding the two structural models (equations 3 and 4) to the measurement model for a 
post-test (equation 1), the model for correct item responses across domains 
(yi/ki = [fy/feiv • -yijkid,- ■ ■yzikirA') leads to the following: 

Piyijki) =®[* 2 / ‘ { i ;’o VI 01 jk y ^ ' 7(n+l) Zjk.n + £ 2 jk) 

n=1 (5) 

+ (^0 + • 0i* + ^ 2 • TRTk y 'y' &(m+2) Zk ln £ 2 &)} P 2 /] • 

m — 1 

To identify the model, the y 0 are set to Os, and variances in X 3CD x 0 ) and Z 5cn x D) (i.e., 
variances at the student level for the pre-test and residual variances at the student level for 
the post-test, respectively) are set to Is. Alternatively, the item discrimination for one of 
the items (e.g., the first item) in each dimension can be set to 1 instead of setting variances 
to 1 to identify the scale unit of the parameters. Variances at the teacher level can be 
estimated for the pre-test and post-test because the same item discriminations are used 
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over levels (assuming no cluster bias). See the online supporting information (Appendix 
SI) for a diagram depicting the MMIRM-MLC for person parameters with two domains. 


2. I. Comparisons with other approaches to measurement error adjustment 

Measurement error adjustment using the MMIRM-MLC is different from measurement 
error adjustment methods in previous structural modelling approaches in which specific 
assumptions are made about the distributional structure of the unobserved variables. 
A description of those differences follows. 

First, measurement error in the MMIRM-MLC is adjusted for a response variable and a 
covariate simultaneously, as in Rabe-Hesketh el al. (2004) and Raudenbush and Sampson 
(1999). Specifically, this simultaneous approach allows us to detect the group difference 
on the error-free latent variable scale (i.e., the ability parameter in IRT is equal to an 
(unbiased) estimator minus (random) error) at post-test, by controlling for the possible 
measurement error in the pre-test scores and by mapping pre-test scores and post-test 
scores on the latent variable scales. However, in previous studies, measurement error was 
mainly adjusted for the response variable (e.g., Fox, 2004) or for the covariate (e.g., 
Battauz & Bellio, 2011; Carroll el al ., 2006; Fox & Glas, 2003; Goldstein el al ., 2008). That 
is, in these previous applications, either a measurement model for the response variable 
(e.g., equation 1 or a classical true score model) or a measurement model for the covariate 
(e.g., equation 2 or a classical true score model) was used. 

Second, a set of multiple items is used to correct for measurement error in the covariate 
using item response models in the MMIRM-MLC (see equations 1 and 2). That is, the set of 
multiple items at level 1 in the MMIRM-MLC is used for correcting for measurement error 
at the individual level and at the cluster level. This approach is different from previous 
approaches to correcting for measurement error in the covariate, including Carroll et al. 
(2006) and Goldstein et al. (2008). These previous studies used a classical true score 
model for total scores (only at the individual level). 

Third, measurement error adjustment in the MMIRM-MLC is done at each level of the 
multilevel data. Specifically, multiple items for each domain (indicated by d) are modelled 
for a latent variable at level 2 (0 and a latent variable at level 3 (Qi ferf ) to correct for 
measurement error in the pre-test scores. Further, multiple items for each domain are used 
for a latent variable at level 2 ( 02 jka) and a latent variable at level 3 ( 02 kd) to correct for 
measurement error in the post-test scores. The group differences, the intervention effects 
(<5 2 in equation 5), can be detected on the error-free latent variable scale, 0 2kd . Raudenbush 
and Sampson (1999) used multiple items at level 1 to measure constructs at level 2 within 
level 3 as in the MMIRM-MLC. However, they did not include item discriminations at level 
2 (such as ol u and a 2/ in the MMIRM-MLC) or regressions among the latent variables (such 
as and in the MMIRM-MLC). 


2.2. Measurement invariance test 

In multiple-measurement (or longitudinal) multilevel data arising from multiple groups, 
there are at least three sources of measurement invariance to test: across time, across 
clusters, and across groups (e.g., control and treatment groups). The measurement 
invariance assumption across time points is not necessary when a pre-test score is used as 
a proxy variable for unobserved factors that predict or explain future attributes (e.g., 
Lockwood & McCaffrey, 2014). Further, it is possible that item discrimination(s) can be 
different for an individual-level latent variable and for a cluster-level latent variable in 
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multilevel item response models. This possibility, called cluster bias (Jak, Oort, & Dolan, 
2013), can be investigated by testing whether item discriminations are equal over levels. 
Finally, invariance across groups is necessary for comparing group means (Bejar, 1980). 
Item response models to test cluster bias and group bias are described in the online 
supporting information (Appendix S2). 

Two models are compared to test cluster bias: (1) a cluster invariance model, in 
which item discriminations over levels 2 and 3 are the same; and (2) a cluster bias model, 
in which item discriminations over levels 2 and 3 are different. Three invariance models 
are compared to investigate the measurement invariance across groups (e.g., Vanden- 
berg & Lance, 2000; Widaman & Reise, 1997): (1) a configural invariance model, in 
which all item parameters are estimated simultaneously in each group under the same 
factor structures; (2) a weak invariance model, in which only discrimination parameters 
are constrained to be equal across groups; and (3) a strong invariance model, in which all 
item parameters are constrained to be equal across groups. 


3. Parameter estimation and model evaluation 

Bayesian analysis was chosen to fit MMIRM-MLCs and the (multigroup) multilevel 
longitudinal item response model to test measurement invariance assumptions. In 
(hierarchical) Bayesian analysis, it is possible to sample complex and high-dimensional 
posterior densities with Markov chain Monte Carlo (MCMC) methods through sampling 
from the conditional distributions of parameters without numerical integration. In this 
study, WinBUGS 1.4.3 (Spiegelhalter, Thomas, Best, & Lunn, 2003) was used to 
implement MCMC. 

For the MMIRM-MLC, joint posterior distributions for parameters & = 

ViTc n + i >d^o^ i \jkfi i} can be rewritten as 


P ( 0 \}> 1 jki • y 2jki ) OtP (y I //,*/1 9) P (y2 jki I 9) 

•{P(a 1 )P(a 2 )P(/J 1 )P(/l 2 )P(y 0 )P(y 1 )P( y( „ +1) )P(^o)P(^)P(5( w+ 2))P(^)} 

•{P(0 vfe |O,S 3 )P(0 lfe |O,E4)P(£ 2 ^|O,S 5 )P(£ 2fe |O,E 6 )} 

• {P(E 3 )P(E 4 )P(S 5 )P(E 6 )}, 

( 6 ) 

where PCyijki 19) is a likelihood function of item responses across domains for pre-test, 
P(yijki 19) is a likelihood function of item responses across domains for post-test, the 
probabilities in the first braces indicate prior distributions of fixed parameters, the 
probabilities in the second braces indicate prior distributions of latent variables, and 
the probabilities in the third braces indicate hyperprior distributions of population 
parameters of the latent variables. A similar specification was also applied to the 
(multigroup) multilevel longitudinal item response model to test measurement invariance 
assumptions across clusters and groups. 

Priors for all fixed effects in a structural model for person parameters (except y 0 to 
identify the model) and item difficulty parameters were set as /¥((),(). 1) in WinBUGS. Item 
discrimination parameters were set to N(0, 1) truncated at 0 1 for a,,- and a 2 ,-, respectively, 


4 The specification in WinBUGS is N(0,1)I(0,') where 1 is a variance. 
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to have stable item discrimination parameter estimates (e.g., Beguin & Glas, 2001, for the 
normal ogive multidimensional item response model). To match the priors on I 3(/J x n) 
and I 5(/J x n) with the model identification constraints, variances in the variance- 
covariance matrix were set to 1 and the priors on correlation coefficient parameters were 
set to Uniformf—1,1). Prior and hyperprior distributions for other parameters were 
specified in WinBUGS as follows: 


Oik r 

^ MN(0 (ox i), £4(0 xd)). 


&2k r 

^ M/V(0 (Z j x 1 ), Sg (dxd)), 


^ 4 (flxfl) r 

Wishart (R,v),R = I D ,v 

= D , and 

E 6 ( 0 x 0 ) r 

^ Wishart (R,v),R = I D ,v 

= D. 


Id denotes the unit matrix of size D, and the degrees of freedom v in the Wishart 
distribution are set to D as the rank of 6 and e to represent vague prior knowledge (the 
mean and variance in the prior distribution on elements in x n> and I, 4 (n x 0) are 2 
with R~Ip). Similar priors and hyperpriors for fixed parameters and random effects were 
chosen for item and person parameters of the (multigroup) multilevel longitudinal item 
response model to test measurement invariance assumptions across clusters and groups. 

In order to ensure that stable parameter estimates are obtained, Gelman and Rubin’s 
(1992) method was chosen as implemented in WinBUGS. Using the results of the 
convergence checking, initial samples are discarded (‘burn-in’) and posterior means or 
medians and standard deviations (i.e., Bayesian standard errors) calculated from 
subsequent iterations. 


3.1. Bayesian model fit 

Competing models (i.e., measurement invariance models, unidimensional vs. multidi¬ 
mensional model) were compared using a relative fit criterion, the deviance information 
criterion (DIC; Spiegelhalter, Best, Carlin, & van der Linde, 2002; see also Verhagen & Fox, 
2013) . A smaller DIC represents a better fit of the model, and a difference of <5 or 10 units 
between models does not provide sufficient evidence for favouring one model over 
another (Spiegelhalter et al ., 2003). The DIC can be calculated easily by specifying the log- 
likelihood along with a model specification in WinBUGS. 

In addition, the adequacy of the fit of the MMIRM is evaluated by comparing observed 
and posterior predictive score frequencies (e.g., Beguin & Glas, 2001) with posterior 
predictive model checking (Rubin, 1984). In addition to the overall model evaluation 
using the posterior predictive frequencies, item fit and person fit were considered 
individual checks. Standardized residuals (Spiegelhapter, Thomas, Best, & Gilks, 1996) 
were considered as a discrepancy measure. Item fit was calculated as the mean of the 
standardized residuals over persons, and person fit was calculated as the mean of the 
standardized residuals over items. Posterior predictive /^-values (ppp-v alues; Meng, 1994) 
for the person fit (Glas & Meijer, 2003) and item fit (Sinharay, 2005) were calculated. 
Values around .5 indicate that a person or an item fits well to the data while values close to 
zero or 1 indicate misfit (Gelman & Meng, 1996). We consider ppp-vahies smaller than 
.025 or larger than .975 as extreme values indicative of misfit at the 5% level (e.g., Sinharay, 
2005). 
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4. Empirical illustration 

Data for the current study were gathered as part of a larger efficacy trial for enhanced 
anchored instruction (EAI). The experimental instruction was designed to improve the 
mathematics skills of middle and high school students, especially those with learning 
difficulties in maths (MD). The design of the efficacy trial was a pre-test-post-test cluster 
randomized trial. Schools, rather than classes or students, were randomly assigned to EAI 
and business as usual (BAU) because the research team did not have control over the 
students’ class assignment. In this illustration, we evaluate an instructional intervention 
called EAI and its impact on students by showing the effect of intervention using an MMIRM- 
MLC. Our purpose for conducting the analysis was to answer the following question: If the 
intervention effect is detected, is it possible to interpret it across cognitive skill areas? 


4.1. Teacher and student samples 

Twenty-four urban and rural middle schools in the southeastern United States participated 
in the study. Half were randomly assigned to EAI and BAU. Each school had one 
participating inclusive maths classroom, although one school had two participating 
classrooms. Teachers in both conditions were comparable in terms of gender (mostly 
female), ethnicity (mostly white), education level (well educated), and years of 
experience (Bottge, Ma, Gassaway, Toland, Butler, & Cho, 2014). In our study, one 
inclusive maths class from each school was sampled, with the exception of one school 
that had two inclusive maths classes. Therefore, a two-level data structure (students 
nested within 25 teachers) was used because there was only one school for which we 
needed to be concerned about clustering at the school level. The smallest number of 
students analysed for a teacher was 7, and the largest was 28. The average cluster size was 
17.84. 

Roughly equal numbers of students in each condition had an identified MD: 62 (28%) of 
223 in EAI and 72 (29%) of248 in BAU. Of the initial sample, 25 students did not respond to 
all items in the pre-test or post-test. As a result, 232 BAU (29% MD) and 214 EAI (26% MD) 
remained in the final sample. Based on chi-square tests of equal proportions, students 
were comparable across instructional conditions in gender, ethnicity, subsidized lunch, 
and disability area, and teachers were comparable in both conditions in terms of gender 
(mostly female), ethnicity (mostly white), education level (well educated), and years of 
experience (Bottge et al. ,2014). Bottge et al. (2014) found that there was no EAI and BAU 
group difference on the pre-test total score scales. 


4.2. Measure: Fraction computation test 

The researcher-developed test, the fraction computation test, administered at the pre¬ 
test and post-test, was used in the current study to illustrate MMIRM-MLC. The test 
comprised 20 items assessing students’ ability to manually add and subtract fractions. 
Item features differed in several ways: (1) addition or subtraction', (2) like 
denominators (| + |) or unlike denominators (8= + 2 |); (3) simple fractions 
(| + |) or mixed numbers (4-^ + | + |); and (4) two stacks (yy|) or three stacks 
(i.e., one more stack in the two-stack example). There were a total of 42 points on the 
test. For 18 of the 20 items, students could earn 0, 1 or 2 points. On two items, 
students could earn 3 points if they simplified the answer (i.e., revised the fraction to 
simple terms). Inter-rater agreement was 99% on the pre-test and 97% on the post-test. 
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Less than 1% of students in the sample received partial scores (i.e., score 1 for 18 of 
the items and scores 1 or 2 for two of the items) on any of the items on the test. Thus, 
in this paper, binary responses were considered, 1 for correct responses and 0 for 
incorrect responses. Partial scores were also considered as incorrect responses. There 
were no missing item responses in the final sample of 446 students for analysis. 


4.3. Analysis and results 

To answer the research question using the MMIRM-MLC, our analysis proceeded as 
follows. All codes, including the model specification in WinBUGS used in the current 
analyses, are available from the first author upon request. 


4.3.1. Step I: Determining distinct domains 

As shown in Table 1, each item had four item attributes. In order to find the most distinct 
item feature for domain scoring, we compared a set of exploratory factor analyses using 
polychoric correlations with Bayes estimator (GIBBS(PXl) option) at each time point 


Table 1. Fit Indices from Exploratory Factor Analyses Extracting 1 and 2 Factors and (GEOMIN 
Rotated) Factor Loadings for a 2-Factor Solution. 








Model fit 







Pre-test 

Post-test 






1-Factor 

2-Factor 

1-Factor 

2-Factor 

ppp-v alue 





0.399 

0.443 

0.401 

0.453 







Factor loadings 




Attributes 



Pre-test 

Post-test 

Item 

Operation 

Denominator 

Type 

Stacks 

Factor 1 

Factor 2 

Factor 1 

Factor 2 

1 

Addition 

Like 

Simple 

2 

0.640 

0.354 

0.651 

0.080 

2 

Addition 

Like 

Simple 

2 

0.610 

0.216 

0.678 

0.201 

3 

Addition 

Unlike 

Simple 

2 

-0.135 

1.048 

-0.265 

1.075 

4 

Addition 

Unlike 

Simple 

2 

-0.186 

1.067 

-0.231 

1.029 

5 

Addition 

Unlike 

Simple 

2 

0.000 

0.965 

-0.189 

1.024 

6 

Addition 

Unlike 

Simple 

2 

-0.026 

0.985 

-0.214 

1.054 

7 

Addition 

Unlike 

Mixed 

2 

-0.132 

1.034 

0.007 

0.940 

8 

Addition 

Unlike 

Mixed 

2 

0.041 

0.912 

0.087 

0.900 

9 

Addition 

Unlike 

Mixed 

2 

0.007 

0.946 

0.062 

0.930 

10 

Addition 

Unlike 

Mixed 

2 

0.032 

0.939 

0.007 

0.924 

11 

Addition 

Unlike 

Simple 

3 

0.089 

0.902 

0.063 

0.886 

12 

Addition 

Unlike 

Simple 

3 

0.092 

0.902 

0.052 

0.921 

13 

Addition 

Unlike 

Mixed 

3 

0.065 

0.901 

0.084 

0.916 

14 

Addition 

Unlike 

Mixed 

3 

0.004 

0.924 

0.005 

0.924 

15 

Subtraction 

Like 

Simple 

2 

0.888 

0.032 

0.883 

0.087 

16 

Subtraction 

Unlike 

Simple 

2 

0.137 

0.871 

0.160 

0.851 

17 

Subtraction 

Like 

Mixed 

2 

0.804 

0.097 

0.731 

0.085 

18 

Subtraction 

Unlike 

Mixed 

2 

0.456 

0.698 

0.002 

0.752 

19 

Subtraction 

Unlike 

Mixed 

2 

0.235 

0.772 

0.212 

0.804 

20 

Subtraction 

Unlike 

Mixed 

2 

0.397 

0.557 

0.096 

0.737 


Note. Bold factor loadings are significant at 5% level. 
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using Mplus version 7.11 (Muthen & Muthen, 1998-2014). The model fit of the 
exploratory factor analyses was evaluated based on posterior predictive model checking 
with a summary measure of fit, the likelihood-ratio chi-square statistic. Corresponding 
PPP- y alues were calculated for each factor solution; /;/;/>-values around .5 indicate that the 
observed pattern would likely be seen in replications of the data if the model were true. 

Table 1 shows the model fit results at each time point for a one-factor and two-factor 
solution and (GEOMIN rotated) factor loadings for a two-factor solution. According to the 
/>/;/•'—values, the one-factor model provided a good fit to the data according to the criteria at 
each time point. However, there was an improvement in model fit with the two-factor 
model at each time point. Shifting from the one-factor to the two-factor model produced 
noteworthy decreases in residual variances for four like items that loaded on the first 
factor, especially at the post-test. Factor loadings were clearly clustered regarding like 
items versus unlike items. As reported in Table 1, the like items (items 1, 2, 15 and 17) 
were highly loaded on factor 1 while the unlike items were highly loaded on factor 2 at the 
pre-test and post-test. Moderate (GEOMIN) factor correlations of .585 and .497 for the pre¬ 
test and post-test, respectively, indicated that two factors can provide two scores with 
distinct meaning. Based on these results, we chose the two-factor model with a between- 
item design where an item loaded on the like factor or unlike factor for domain scoring. 
When there is evidence of a second dimension on a specific skill domain, having a two- 
factor model yields diagnostic interpretations as compared to a one-factor model. 


4.3.2. Step 2: Selecting the measurement model 

Intraclass correlations (ICCs) were calculated to investigate the multilevel structure of the 
data using the data at each time point. The ICC for the observed outcomes for each item 
(e.g., Muthen &Asparouhov, 2013) ranged from .058 to .297 for the pre-test and from .071 
to .347 for the post-test, based on results of the two-parameter multilevel unidimensional 
normal ogive model at each time point. A common rule of thumb is that ICCs over .05 
indicate the necessity of multilevel analysis (e.g., Jak et al. ,2013). According to the rule of 
thumb, there is non-ignorable dependency due to clusters (teachers). Accordingly, the 
MMIRM was chosen as a (multilevel) measurement model for the pre-test and post-test. 

Table 2 reports summary information about the standard deviation (i.e., Bayesian 
standard error) of 0 estimates from MMIRM and within and between reliability of the total 
scores (Geldhof, Preacher, & Zyphur, 2014) for each domain at pre-test and post-test. This 
information presents evidence that there was non-ignorable measurement error on both 
the latent variable scale and the total score scale. 


4.3.3. Step 3: Checking measurement invariance 

From step 1, a two-factor model was chosen to provide diagnostic interpretations based 
on the specific skill domain, even though there is evidence that the one-factor model fitted 
relatively well compared to the two-factor model. For measurement invariance checking, 
the one-factor model was estimated to check the measurement invariance over the 
clusters (i.e., teachers) and groups (i.e., BAU vs. EAI, non-MD vs. MD). 

Table SI in the online supporting information presents the measurement models, their 
constraints, and DIC values for three invariance models for clusters and groups. The ‘burn- 
in’ period ranged from 4,000 to 6,000 for invariance models in the MCMC analyses. 
Posterior means were used for calculating the DIC. Differences in the DIC values between 
cluster bias and cluster invariance models were <5, so that the cluster invariance model 
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Table 2. Measurement error information for IRT scale scores of MMIRM-MLC and for total scores 




Pre-test 


Post-test 


Like 

Unlike 

Like 

Unlike 

IRT-based: Descriptive information for standard deviation of 0 estimates 



Student level 

Mean 

0.80 

0.49 

0.71 

0.45 

SD 

0.12 

0.26 

0.18 

0.24 

Min. 

0.44 

0.33 

0.43 

0.35 

Max. 

0.99 

0.75 

1.09 

0.76 

Teacher level 

Mean 

0.77 

0.45 

0.58 

0.39 

SD 

0.06 

0.07 

0.05 

0.07 

Min. 

0.61 

0.62 

0.48 

0.31 

Max. 

0.83 

0.78 

0.63 

0.52 

Total score-based 

Within reliability 

0.43 

0.69 

0.50 

0.70 

Between reliability 

0.56 

0.69 

0.51 

0.74 


was chosen as the simpler model. Given the cluster invariance model, group invariance 
tests were investigated. A weak invariance model was chosen for BAU versus EAI and non- 
MD versus MD. 

Whether BAU and EAI or non-MD and MD can be scored and compared on the same 
scale in the presence of weak invariance violation was checked by comparing the 
correlations between the scores from the two MMIRM-MLC models (without any manifest 
covariates) with weak invariance and strong invariance assumptions. The correlation 
coefficients of the scores from the two MMIRM-MLC models for BAU and EAI and for non- 
MD and MD were highly correlated (>.927). This indicates that the relative ordering of 
persons’ scores did not change much when measurement weak invariance was ignored 
for BAU and EAI or non-MD and MD. In addition, the results in the group mean differences 
(i.e., BAU and EAI or non-MD and MD) were similar between the two MMIRM-MLC models 
with weak invariance and strong invariance assumptions. Thus, in the following analysis, a 
strong invariance model for BAU and EAI or non-MD and MD was assumed. 


4.3.4. Step 4: Adding covariates to the measurement model and model evaluation 
Now the MMIRM-MLC was fitted to answer the research question by adding an 
intervention condition covariate to the measurement model. The model is called MMIRM- 
MLC model 1. MMIRM-MLC model 2 is MMIRM-MLC model 1 plus student-level and 
teacher-level demographic information. 

A burn-in of 4,000 iterations was used for all parameters of MMIRM-MLC models 1 and 
2, based on Gelman and Rubin’s (1992) statistic with three chains. The 10,000 post-burn- 
in iterations were obtained to calculate posterior moments. Monte Carlo errors for all 
parameters in all analyses were less than about 5% of the sample standard deviation. All 
95% posterior intervals included the observed data, which indicates that MMIRM-MLC 
models 1 and 2 was appropriate for the data. The ppp-v alues for all items at pre-test and 
post-test were between .025 and .075, indicating that the items were a good fit to the data. 
In MMIRM-MLC model 1, there were 8% and 7% of persons with ppp-vahies >.075 for the 
pre-test and post-test, respectively, indicating a misfit in this model. In MMIRM-MLC model 
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2, there were 8% and 6% of persons with ppp-v alues >.075 for the pre-test and post-test, 
respectively. They all were at the lower end of the score distribution. 

Table 3 shows the item parameter estimates of MMIRM-MLC model 1. 5 Item parameter 
estimates of MMIRM-MLC model 2 were similar to those of MMIRM-MLC model 1. At the 
pre-test and post-test, items vary in terms of item discriminations and difficulties, and 
items that have a like denominator were less discriminating and less difficult than items 
that have an unlike denominator. 


4.4. Answers to research question 

Table 4 presents the results of MMIRM-MLC models 1 and 2 for person parameters. The 
change in significance and magnitude of the intervention effect was small when other 
demographic information for the students and teachers was added to MMIRM-MLC models 
1 and 2. For illustration purposes, the results of MMIRM-MLC model 2 with two student- 
level covariates (MD and gender) and one teacher-level covariate (years of teaching special 
education) are shown in Table 4. Covariates were coded as follows: BAU (coded as 0) and 
EAI (coded as 1) groups, non-MD students (coded as 0) and MD students (coded as 1), 
female students (coded as 0) and males students (coded as 1), and mean-centred years of 
teaching general education (M = 11.1, SD = 1.7}. 

In Table 4, ‘L.TRT’ and ‘U.TRT’ represent the estimated difference between the means 
of the EAI and BAU post-test scores for a like domain and an unlike domain, respectively, 
adjusted for the pre-test scores on the post-test scores. In MMIRM-MLC model 1, the pre¬ 
test score effects on the like domain and the unlike domain were statistically significant at 
the student level and at the teacher level. Significant intervention effects were found for 
like and unlike domains (i.e., 0.890, credible interval [Cl] [0.506, 1.499], for the like 
domain and 1.039, Cl [0.601, 1.471], for the unlike domain). Specifically, the EAI group 
performed 0.890 higher than the BAU group for the like domain and the EAI group 
performed 1.039 higher than the BAU group for the unlike domain. In MMIRM-MLC model 
2, the effects of student-level and teacher-level covariates were not significant and the 
effects of pre-test scores and the group difference between the EAI and BAU groups were 
similar to those of MMIRM-MLC model 1. 


4.5. Result comparisons across different approaches for measurement error treatment 

An MMIRM with multilevel manifest covariate (MMC) and a multilevel model (MM, 
specifically a multilevel multivariate random intercept model) with MLC were fitted to 
the same empirical data to show the consequences of ignoring measurement error in a 
covariate or a response variable. Measurement error in pre-test scores (covariate) is 
ignored in the MMIRM-MLC, whereas measurement error in post-test scores (response 
variable) is ignored in the MM-MLC. In the MMC of the MMIRM-MMC, pre-test total 
scores (J^ =1 yijkid = yijk.d) were decomposed into within pre-test total scores 
(yuk d~y\.k.d< wher e y l kd is a cluster mean) and between pre-test total scores (j 1 , ka ) 
for each domain. For the multilevel (linear) model, post-test total scores 
\ yijkid = yijk.d) were used for each domain. In these two models, pre-test scores 
and an intervention condition were considered covariates as in MMIRM-MLC model 1. 


5 As shown in equations 1 and 2, the item parameters have an/ x D vector. With a between-item design, all items 
have one set of item parameters for the dimension d. In the table, the item parameter estimates are presented in 
one column for simplicity. 



Table 3- Item attributes and item parameter estimates (95% CIs) of MMIRM-MLC model 1 
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Table 4. Results of empirical study: Parameter estimates and 95% CIs 
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The MMIRM-MLC and MM-MLC are specified in the online supporting information 
(Appendix S3). WinBUGS was used to fit the MMIRM-MLC and MM-MLC with priors and 
hyperpriors comparable to those used in MMIRM-MLCs. 

Table 4 presents the results for the MMIRM-MLC and MM-MLC. Pre-test effects and 
intervention effects were standardized on their relevant scale for comparison among the 
three models, MMIRM-MLC model 1, MMIRM-MLC model 2 and MM-MLC. Table S2 in the 
online supporting information shows the standardized estimates of the three models, 
based on results presented in Table 4. Compared to the standardized effects of pre-test 
scores in MMIRM-MLC model 1, pre-test effects were underestimated in the MMIRM-MMC 
and MM-MLC. The effects of standardized intervention conditions were similar between 
the MMIRM-MMC and MMIRM-MLC model 1. However, intervention effects were 
underestimated in the MM-MLC. 


5. Simulation study 

A simulation study was designed to examine parameter recovery of the MMIRM-MLC 
under Bayesian estimation using WinBUGS in various multilevel designs when the 
population data-generating model is an MMIRM-MLC. In addition, the results of pre-test 
effects and intervention condition effects were compared across the MMIRM-MLC, 
MMIRM-MMC and MM-MLC to show the consequences of using total scores when the 
population data-generating model is MMIRM-MLC model 1. WinBUGS was used to fit the 
MMIRM-MLC and MMIRM-MMC (see the online supporting information [Appendix S3] for 
a description of the MMIRM-MLC and MMIRM-MMC). 


5.1. Simulation design 

We selected simulation conditions that may affect the results of person parameters at the 
cluster level, as has been found in the empirical research question (e.g., the effect of the 
intervention effect) in previous research (e.g., Liidtke, Marsh, Robitzsch, & Trautwein, 
2011; Preacher, Zhang, &Zyphur, 2011). The design includes the number of clusters and 
the number of individuals per cluster. The number of clusters was set to K = 24, 50, or 
100. A sample of 24 and 50 clusters is common in educational experimental intervention 
research, as in our empirical illustration. Examples of large numbers of clusters include 
national or international educational assessments such as the National Assessment of 
Educational Progress and the Trends in International Mathematics and Science Study. 
Accordingly, 100 clusters were chosen. Unlike in our empirical study, balanced cluster 
sizes were considered to investigate the effect of cluster sizes, including n*. = 5, 20, or 50, 
as used in other multilevel studies (e.g., Preacher et al ., 2011). A cluster size of 5 is found 
in small group designs (e.g., Kenny, Mannetti, Pierro, Livi, & Kashy, 2002). Given a 
selected number of clusters and number of individuals per cluster, the total number of 
individuals results in nine different sample sizes, / = 120, 250, 480, 500, 1,000, 1,200, 
2,000, 2,500, or 5,000. One hundred replications were simulated for each of the nine 
different multilevel designs. 

The same number of clusters were assigned to be either a control group or a treatment 
group for a balanced design. As in the empirical study, a 20-item test with a between-item 
design was considered: Four items for domain 1 and 16 items for domain 2. The item 
parameter estimates and person parameter estimates, including an intervention effect that 
we obtained in the empirical study, were considered as the true parameters of MMIRM- 
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MLC model 1, as reported in Table 4. The ICC varied across items between .058 and .297 at 
the pre-test and between .016 and .347 at the post-test. 


5.2. Result hypotheses 

The MMIRM-MLC was expected to yield the least bias because the population data- 
generating model was the MMIRM-MLC. The effects of pre-test scores CTi and d,) and the 
effect of intervention (S 2 ) are expected to be different, depending on the treatment of 
measurement error in pre-test scores and post-test scores. In the presence of measure¬ 
ment error in pre-test scores as in the MMIRM-MMC, the effects of pre-test scores are 
expected to be biased (e.g., Liidtke et al, 2011). However, the effects of intervention 
conditions are not expected to be biased in the presence of measurement error in pre-test 
scores when there is no intervention effect at pre-test (Cho & Preacher, 2015). In the 
presence of measurement error in response variables as in the MM-MLC, both the effect of 
pre-test scores and the effects of intervention can be biased (e.g., Fox, 2004). 


5.3. Analysis 

The same priors specified earlier were used in the MMIRM-MLC and comparable priors 
and hyperpriors used in the MMIRM-MLC were used for the MMIRM-MLC and MM-MLC. 
Gelman and Rubin’s (1992) statistic was used to evaluate convergence with three chains. 
One replication of each condition was used for convergence checking. No convergence 
problems were encountered in any replications for the MMIRM-MLC, except the sample 
size condition n k = 5 and K = 24 (total sample size = 120). This non-convergence 
problem may be because the sample size is too small to estimate 98 parameters (80 item 
parameters (20 items x 4 kinds), 10 structural parameters, and 8 variance or covariance 
parameters). Only the converged results are reported below. There were no convergence 
problems in the MMIRM-MMC or the MM-MLC. A bum-in of 5,000 iterations was used for 
all parameters in the MMIRM-MLC, and a burn-in of 4,000 iterations was used for all 
parameters in the MMIRM-MMC and MM-MLC. The same burn-in was set for the other 
replications in each condition. An additional 6,000 iterations were obtained to estimate 
the posterior moments in the MMIRM-MLC, MMIRM-MMC and MM-MLC. Monte Carlo 
errors for all parameters were less than about 5% of the sample standard deviation in all 
three models. 

Percentage relative bias was calculated to show the accuracy of the parameter 
estimates from the MMIRM-MLC, MMIRM-MMC and MM-MLC. It is given by 
100 x [(§ — 8)/8] as an example. Before calculating percentage relative bias, estimates 
of the three models were standardized (see Table S2 in the online supporting information 
for the calculation of standardized estimates of the three models based on the empirical 
results in Table 4), as an example. 


5.4. Simulation results 

For the analysis of the MMIRM-MMC, intervention effects were first tested on pre-test total 
scores by adding a covariate of an intervention condition to the MMC of the MMIRM-MMC. 
No significant intervention effects were found in any conditions in the MMIRM-MMC, 
based on a 95% Cl test. 

Table 5 presents the percentage relative bias for pre-test effects (y : and i),) and 
intervention effects (V> 2 ) for the MMIRM-MLC, MMIRM-MMC and MM-MLC. The following 
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overall patterns in percentage relative bias were observed, as reported in Table 5. First, the 
percentage relative bias for pre-test effects was much lower for the MMIRM-MLC than for 
the MMIRM-MMC and MM-MLC in all conditions, and it was lower for the MMIRM-MMC 
than for the MM-MLC. For the pre-test effect estimate at the individual-level (y)), 
percentage relative bias ranged in magnitude from 0.3 to 18.6 in the MMIRM-MLC, from 
—85.1 to —29.9 in the MMIRM-MMC and from —116.0 to —52.7 in the MM-MLC. For the 
pre-test effect estimate at the cluster level (<5i), the percentage relative bias ranged in 
magnitude from 0.3 to 15.8 in the MMIRM-MLC, from — 70.8 to — 19 ■ 1 in the MMIRM-MMC 
and from —338.4 to —44.3 in the MM-MLC. 

Second, the percentage relative bias for intervention effects was similar between the 
MMIRM-MLC and MMIRM-MMC, except for the condition with K = 24 and n k = 5. The 
percentage relative bias ranged in magnitude from —12.0 to 0.2 in the MMIRM-MLC and 
from —14.6 to 0.9 in the MMIRM-MMC. However, the percentage relative bias for 
intervention effects in the MM-MLC was much larger than in the MMIRM-MLC and MMIRM- 
MMC. It ranged in magnitude from —107.4 to —49.2. 

Third, overall, the percentage relative bias decreased with increasing cluster size (n k ) 
and number of clusters ( K ) for pre-test effects (y, and d,) and intervention effects (ti z ) in all 
three models, although there were three conditions that did not have that pattern: n k = 50 
in the MMIRM-MMC for h u ,n k = 50 in the MM-MLC for 8,,, and n k = 50 in the MM-MLC 
for S 2 2 - 

Table 6 reports the percentage relative bias for fixed parameter estimates (for fixed 
parameter estimates not reported in Table 5) and population parameter estimates of 
random (residual) effects in the MMIRM-MLC. The degree of bias decreased as the cluster 
size ( n k ) and number of clusters (K) increased for all parameter estimates. Unlike the item 
parameters and population parameters of random (residual) effects, the § 0 in the MMIRM- 
MLC tended to be underestimated when K and n L , decreased. 


6. Summary and discussion 

This paper has specified the model for detecting the intervention effect when MMIRMs 
were used for explicit measurement error modelling in the use of pre-test and post-test 
scores. The main application of the MMIRM-MLC presented in this paper was to detect a 
more diagnostic intervention effect by estimating the intervention effect for each domain. 
In the empirical illustration, a four-step analysis was implemented for applying the 
MMIRM-MLC to an instructional intervention study in a pre-test-post-test cluster 
randomized trial. 

In the simulation study, the accuracy of parameter estimates for the MMIRM-MLC was 
investigated in various multilevel designs including a design similar to the empirical study. 
Parameter accuracy for the all parameters was acceptable (with an acceptable bias 
criterion set to 15%) in all conditions considered in this study for the MMIRM-MLC, except 
for conditions with a small cluster size ( n k = 5). When using the total scores as a covariate 
(i.e., pre-test scores) as in the MMIRM-MLC, unacceptable bias was found in pre-test 
effects, whereas acceptable bias was found in intervention effects, except for a condition 
with a small cluster size ( n k = 5) and number of clusters (76=24). This finding indicates 
that measurement error in a covariate may not be problematic in detecting intervention 
effects on post-test when there is no intervention effect on pre-test, which is often the case 
in cluster randomized trials. On the other hand, in the presence of measurement error in 
response variables (i.e., post-test scores), unacceptable bias can be found in both pre-test 
effects and intervention effects. 


Table 5. Results of simulation study: Comparisons of pre-test effects and treatment effects of MMIRM-MLC, MMIRM-MMC and MM-MLC 
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Table 6. Results of simulation study: Percentage relative bias of MMIRM-MLC 
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We now discuss the limitations of the current study and future work. First, the unique 
application in the current study was to detect the intervention effect for each domain for 
its diagnostic value. Sinharay (2010) showed that subscores should meet strict standards 
of reliability, and that weak correlation between domain scores has added value in terms of 
mean square error in estimating the true subscore. Reliable subscores should be obtained 
to make valid inferences about scores in the subtest domains. Lack of sufficient reliability 
is a concern when there are a small number of items for a domain. For the current 
application, a moderate correlation coefficient between two domain scores was found at 
the student level (.691), and a small correlation coefficient was found at the teacher level 
(.099). However, there were four items for the like domain and 16 items for the unlike 
domain. Thus, the reliability of the subscore for the like domain can be questioned from a 
value-added perspective. 

Second, there were two main sources of measurement bias in the empirical study: 
clusters and groups. We first tested cluster invariance and then tested the possibilities of 
measurement invariance across the groups based on the results of the cluster invariance 
test. It is important to note that this is not the only step for testing invariance. For example, 
measurement invariance across groups can be tested first, and then cluster bias can be 
investigated. Jak et al. (2013) stated that there is no universally optimal procedure in most 
situations, and different procedures generally identify the same items as being biased, but 
the power to detect bias may vary. A comparison study with alternative procedures is 
needed to determine the Type I error and power to detect measurement invariance in the 
use of the DIC. 

Third, the simulation study has the same limitations as other simulation studies, that is, 
the conditions we considered are limited because the simulation study was mainly 
designed to check the accuracy of parameter estimates in various multilevel designs. The 
limited conditions include the true parameters and the ICC found from the empirical 
study, the number of items for each domain, and the balanced design. More extensive 
simulations that vary the limited conditions should be conducted to make solid 
generalizations. 

Fourth, one may think that a comparison among the MMIRM-MLC, MMIRM-MMC and 
MM-MLC approaches is unfair when the population data-generating model is the MMIRM- 
MLC. However, we chose the MMIRM-MLC as the population data-generating model for 
two main reasons. First, our main interest in comparing the three models was to 
investigate the extent to which pre-test effects on total scores or intervention effects on 
total scores may produce misleading inferences on an error-free latent construct. In 
addition, we were interested in the degree to which the MMIRM-MLC would outperform 
the MMIRM-MMC and MM-MLC even though it may be obvious that the MMIRM-MLC 
would perform better than the MMIRM-MMC and MM-MLC overall in this situation. Still, 
there was no guarantee that the MMIRM-MLC would recover its own parameters well even 
when the MMIRM-MLC was the population data-generating model. Indeed, the MMIRM- 
MLC would not converge while the MMIRM-MMC would when the sample size was small 
(K = 24 and n k = 5). 

To conclude, the present study focused on the empirical illustration of the MMIRM- 
MLC and its evaluations in using Bayesian analysis. When measurement error is a concern 
in using a response variable and a covariate, the MMIRM-MLC can be an analytic tool for 
detecting an intervention effect in pre-test-post-test cluster randomized trials. However, 
given the results of the simulation study, the MMIRM-MLC can be used when both the 
number of clusters and the cluster size are large enough. 


Multilevel multidimensional item response model 43 I 


Acknowledgements 

The first author received the following financial support for the research, authorship, and/or 
publication of this article: 2013 National Academy of Education/Spencer Postdoctoral 
Fellowship. The data used in the paper were collected with the following support: US 
Department of Education, Institute of Education Sciences, PR Number H324A090179. Any 
opinions, findings, or conclusions are those of the first author and do not necessarily reflect the 
views of the supporting agencies. 


References 

Aitkin, M., & Longford, N. (1986). Statistical modelling issues in school effectiveness studies (with 
discussion). Journal of the Royal Statistical Society, Series A, 149 , 1-42. 

Battauz, M., & Bellio, R. (2011). Structural modeling of measurement error in generalized linear 
models withRasch measures as covariate. Psychometrika , 76, 40-56. doi: 10.1007/sl 1336-010- 
9195-z 

Beguin, A. A., & Glas, C. A. W. (2001). MCMC estimation of multidimensional IRT models. 

Psychometrika, 66, 541-562. doi:10.1007/BF02296l95 
Bejar, 1.1. (1980). Biased assessment of program impact due to psychometric artifacts. Psychological 
Bulletin, 87, 513-524. doi:10.1037/0033-2909.87.3.513 
Bottge, B. A. Ma, X., Gassaway, L., Toland, M. D., Butler, M., & Cho, S.-J. (2014). Effects of blended 
instructional models on math performance. Exceptional Children, Measurement error in 
nonlinear models, 80, 423-437. doi: 10.1177/0014402914527240 
Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models in social and behavioral 
research: Applications and data analysis methods. Newbury Park, CA: Sage. 

Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M. (2006). Measurement error in 
nonlinear models: A modern perspective. Boca Raton, FL: Chapman & Hall/CRC. 

Cho, S.-J., & Preacher, K.J. (2015). Measurement error correctionformula for group differences in 
a cluster-randomized design. Unpublished manuscript, Psychological Sciences, Vanderbilt 
University. 

Culpepper, S. A., & Aguinis, H. (2011). Using analysis of covariance (ANCOVA) with fallible 
covariates. Psychological Methods, 16, 166-178. doi:10.1037/a0023355 
De Boeck, P., & Wilson, M. (2004). Explanatory item response models: A generalized linear and 
nonlinear approach. New York, NY: Springer. 

de la Torre, J., Song, H., &Hong, Y. (2011). A comparison of four methods of IRT subscoring. Applied 
Psychological Measurement, 35, 296-316. doi:10.1177/0146621610378653 
Fox, J-.P. (2004). Modelling response error in school effectiveness research. Statistica Neerlandica, 
58, 138-160. doi:10.1046/j.0039-0402.2003.00253.x 
Fox, J-.P. & Glas, G. A. W. (2003). Bayesian modeling of measurement error in predictor variables 
using item response theory. Psychometrika, 68, 169-191. doi:10.1007/BF02294796 
Geldhof, G. J., Preacher, K. J., & Zyphur, M. J. (2014). Reliability estimation in a multilevel 
confirmatory factor analysis framework. Psychological Methods, 19, 72-91. 

Gelman, A. & Meng, X.-L. (1996). Model checking and model improvement. In W. R. Gilks, 
S. R. Richardson, andD. J. Spiegelhalter (Eds 9, Markov chain Monte Carlo inpractice (pp. 189- 
201). London, UK: Chapman & Hall. 

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. 

Statistical Science, 7, 457-472. doi:10.1214/ss%2F1177011136 
Glas, C. A. W., & Meijer, R. R. (2003). A Bayesian approach to person fit analysis in item response 
theory model. Applied Psychological Measurement, 27, 217-233. doi:10.1177/01466216 
03027003003 

Goldstein, H. (2003). Mtdtilevel statistical models (3rd ed.). London, UK: Edward Arnold. 


432 Sun-Joo Cho and Brian Bottge 


Goldstein, H., Kounali, D., & Robinson, A. (2008). Modelling measurement errors and category 
misclassification in multilevel models. Statistical Modelling, 8, 243-261. doi: 10.1177/1471082 
X0800800302 

Jak, S., Oort, F. J., & Dolan, C. V. (2013). A test for cluster bias: Detecting violations of measurement 
invariance across clusters in multilevel data. Structural Equation Modeling, 20, 265-282. 
doi: 10.1080/10705511.2013.769392 

Kenny, D., Mannetti, L., Pierro, A., Livi, S., & Kashy, D. (2002). The statistical analysis of data from 
small groups. Journal of Personality and Social Psychology, 83, 126-137. doi: 10.1037/0022- 
3514.83.1.126 

Lockwood, J. R., & McCaffrey, D. F. (2014). Correcting for test score measurement error in ANOVA 
models for estimating treatment effects .Journal of Educational and Behavioral Statistics, 39, 
22-52. doi: 10.3102/1076998613509405 

Liidtke, O., Marsh, H. W., Robitzsch, A, & Trautwein U. (2011). A 2 x 2 taxonomy of multilevel latent 
contextual models: Accuracy-bias trade-offs in full and partial error correction models. 
Psychological Methods, 16, 444-467. doi:10.1037/a0024376 

McDonald, R. P. (1993). A general model for two-level data with responses missing at random. 
Psychometrika, 58, 575-585. doi:10.1007/BF02294828 

Meng, X.-L. (1994). Posterior predictive p-values .Annals of Statistics, 22, 1142-1160. doi:10.12l4/ 
aos/1176325622 

Muthen, B. O. (1994). Multilevel covariance structure analysis. Sociological Methods and Research, 
22, 376-398. doi:10.1177/0049124194022003006 

Muthen, B. O., & Asparouhov, T. (2013). Item response modeling in Mplus: A multi-dimensional, 
multi-level, and multi-time point example. In W. J. van der Linden & R. K. Hambleton (Eds), 
Handbook of item response theory, models, statistical tools, and applications. Boca Raton, FL: 
Chapman & Hall/CRC Press. 

Muthen, L. K. & Muthen, B.O. (1998-2014). Mplus [Computerprogram]. Los Angeles, CA: Author. 

Porter, A. C., & Raudenbush, S. W. (1987). Analysis of covariance: Its model and use in 
psychological research . Journal of Counseling Psychology, 34, 383-392. doi: 10.1037/0022- 
0167.34.4.383 

Preacher, K. J., Zhang, Z., & Zyphur, M. J. (2011). Alternative methods for assessing mediation in 
multilevel data: The advantages of multilevel SEM. Structural Equation Modeling, 18, 161-182. 
doi: 10.1080/10705511.2011.557329 

Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2004). Generalized multilevel structural equation 
modeling .Psychometrika, 69, 167-190. doi:10.1007/BF02295939 

Raudenbush, S. W. (1997). Statistical analysis and optimal design for cluster randomized trials. 
Psychological Methods, 2, 173-185. doi:10.1037/1082-989X.2.2.173 

Raudenbush, S. W., & Sampson, R. (1999). Assessing direct and indirect effects in multilevel designs 
with latent variables. Sociology Methods and Research, 28, 123-153. doi: 10.1177/0049124 
199028002001 

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied 
statistician . Annals of Statistics, 12, 1151-1172. doi:10.1214/aos/1176346785 

Sinharay, S. (2005). Assessing fit of unidimensional item response models using a Bayesian 
approach Journal of Educational Measurement, 42, 375-394. doi: 10.1111/j. 1745-3984.2005. 
00021.x 

Sinharay, S. (2010). How often do subscores have added value? Results from operational and 
simulated data .Journal of Educational Measurement, 47, 150-174. doi: 10.1111/j. 1745-3984. 
2010.00106.x 

Spiegelhalter, D. J., Thomas, A., Best, N. G., & Gilks, W. R. (1996). BUGS: Bayesian inference using 
Gibbs sampling, version 0.5 (version ii). Cambridge, UK: MRC Biostatistics Unit. 

Spiegelhalter, D. J., Best, N. G., Carlin, B. R., & van der Linde, A. (2002). Bayesian measures of model 
complexity and fit. Journal of the Royal Statistical Society, SeriesB, 64, 583-616. doi:10.1111/ 
1467-9868.00353 


Multilevel multidimensional item response model 433 


Spiegelhalter, D. J., Thomas, A., Best, N. G., &Lunn, D. (2003). WinBUGS user manual. Cambridge, 
UK: MRC Biostatistics Unit, Institute of Public Health. Retrieved from http://www.mrc- 
bsu.cam.ac.uk/bugs/winbugs/manuall4.pdf 

Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance 
literature: Suggestions, practices, and recommendations for organizational research. 
Organizational Research Methods, 3, 4-69. doi: 10.1177/109442810031002 

Verhagen, J., & Fox, J.-P. (2013). Longitudinal measurement in health-related surveys. A Bayesian 
joint growth model for multivariate ordinal responses. Statistics in Medicine, 32, 2988-3006. 
doi:10.1002/sim.5692 

Widaman, K. F., & Reise, S. P. (1997). Exploring the measurement invariance of psychological 
instruments: Applications in the substance use domain. In K. J. Bryant, M. Windle, & S. G. West 
(Eds.), The science of prevention: Methodological advancesfrom alcohol and substance abuse 
research (pp. 281-324). Washington, DC: American Psychological Association. 

Received 25 June 2014; revised version received 3 October 2014 


Supporting Information 

The following supporting information may be found in the online edition of the article: 

Appendix SI. Diagram of person parameters of MMIRM-MLC. 

Appendix S2. Item response models for invariance tests. 

Appendix S3. Model description of MMIRM-MMC and MM-MLC. 

Table SI. Results of measurement invariance tests using DIC. 

Table S2. Model comparisons: Standardized estimates of pre-test effects and interven¬ 
tion effects among MMIRM-MMC, MM-MLC, and MMIRM-MLC model 1. 



Copyright of British Journal of Mathematical & Statistical Psychology is the property of 
Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a 
listserv without the copyright holder's express written permission. However, users may print, 
download, or email articles for individual use. 



